{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Power Law Distributions\n",
    "\n",
    "$$ p(x) = cx^{-\\alpha} $$\n",
    "$$ \\text{a.k.a: } \\log(p(x)) = -\\alpha\\log(x) + c $$\n",
    "\n",
    "## Previously on this program: \n",
    "Last time (see 'long-tail-distributions-001') I talked about power law distributions, I talked about them in the context of long tail distributions as a whole. Before launching into a discussion of the power law distributions, I want to reiterate that *power laws are just **one** type of long tail distribution*. \n",
    "- A long tail distribution is *any* distribution with more weight in the tail than the exponential distribution\n",
    "- One other example of a commonly occurring long tail distribution is the log-normal distribution\n",
    "- Many empirical distributions are long tail for only some parts of the domain\n",
    "\n",
    "## Today:\n",
    "- Define a power law distribution\n",
    "- Show some effective ways of examining a power law distribution\n",
    "- Demonstrate a commonly proposed generative process for power laws found in the world\n",
    "\n",
    "## Goals:\n",
    "Long tail distributions, and even power law distributions, show up with some regularity in empirical Twitter data. Distributions such as \"the number of likes a Tweet recieves\" seem to follow a long tail distribution (possibly even explicitly a power law distribution).\n",
    "\n",
    "Understanding how to recognize long tail distributions, how to represent them, how to test for them, and where they might be coming from could help us indentify underlying processes for the emergence fo certain patterns in Twitter data.\n",
    "\n",
    "## Credit where credit is due:\n",
    "Today's RST is brought to you by the in part by the lovely Dr MEJ Newman, and his 2006 paper: *Power laws, Pareto distributions and Zipf's law*. This notebook is by me, Fiona Pigott."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Plotting library\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "# mathematics\n",
    "from math import exp, pi, sqrt, log\n",
    "from numpy import linspace, hstack, round, random, arange, mean, logspace, argmax, array\n",
    "from collections import Counter\n",
    "from random import sample, choice\n",
    "from random import uniform\n",
    "#from functools import reduce"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Power law: \"*y* varies as a function of some power ($\\alpha$) of *x*\"\n",
    "\n",
    "$$ p(x) = cx^{-\\alpha} $$\n",
    "$$ \\text{a.k.a: } \\log(p(x)) = -\\alpha\\log(x) + c $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def power_law(x, alpha):\n",
    "    '''\n",
    "    x: iterable of x values\n",
    "    alpha: parameter alpha\n",
    "    '''\n",
    "    return [point**(-alpha) for point in x]\n",
    "\n",
    "# x values\n",
    "x = linspace(1,5,10000)\n",
    "alphas = linspace(1,3,5)\n",
    "for alpha in alphas:\n",
    "    _ = plt.plot(x, power_law(x, alpha), '-', markersize = 1)\n",
    "    _ = plt.ylabel(\"Probabilty mass at x\")\n",
    "    _ = plt.xlabel(\"x\")\n",
    "    _ = plt.title(\"Power Law Distribution\")\n",
    "_ = plt.legend([\"alpha = {}\".format(round(a,2)) for a in alphas], loc = 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for alpha in alphas:\n",
    "    _ = plt.loglog(x, power_law(x, alpha), '-', markersize = 1)\n",
    "    _ = plt.ylabel(\"Probabilty mass at x\")\n",
    "    _ = plt.xlabel(\"x\")\n",
    "    _ = plt.title(\"Power Law Distribution\")\n",
    "_ = plt.legend([\"alpha = {}\".format(round(a,2)) for a in alphas], loc = 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sampling a power law distribution\n",
    "\n",
    "I'm going to generate a sample of power-law distributed data using a transformation method--by using a fuctional transformation on uniformly distributed random data to get power law distributed random data.\n",
    "\n",
    "- generate a random real number r uniformly distributed in the range $0 \\leq r < 1$\n",
    "- compute $x = x_{min}(1 -r)^{-1/(\\alpha -1)}$\n",
    "- $x$ is a power law distributed number in the rangoe $x_{min} \\leq x < \\infty$ with an exponent $\\alpha$\n",
    "\n",
    "I want to show how sampling a power law is subject to lots of noise--especially in the tail--and how to see that in your representations of data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# generate 100000 samples of a uniform distribution on 0,1\n",
    "xmin = 1\n",
    "r = random.uniform(0,1,10**6)\n",
    "# generate the power law distributed samples\n",
    "# set alpha\n",
    "alpha = 2.5\n",
    "power_law_samples = [xmin*(1-sample)**(-1/(alpha-1)) for sample in r]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Looking at a power law distribution\n",
    "\n",
    "From Newman: \n",
    ">A histogram of a quantity with a power-law distribution appears as a straight line when plotted on logarithmic scales. Just making a simple histogram, however, and plotting it on log scales to see if it looks straight is, in most cases, a poor way proceed.\n",
    "\n",
    "Simply plotting a histogram of an empirical distribution to determine whether or not it is power law distributed is \"a poor way to proceed\" because of the very noisy data in the low-frequency very large values in the tail of the distribution. Newman suggests a few different ways of looking at power law distribtuions:\n",
    "\n",
    "1. Simple histogram\n",
    "2. Histogram on logarithmic scales\n",
    "3. Histogram with logarithmic bin sizes\n",
    "4. A cumulative distribution, or rank-frequncy plot (we looked at these in log-tail-distributions-001)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A Simple histogram"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def simple_histogram(samples, bin_sizes = None, num_bins = None):\n",
    "    # sort the samples\n",
    "    sorted_samples = sorted(samples)\n",
    "    # grab the max for the bins\n",
    "    max_val = max(samples)\n",
    "    # min is 0, but not inclusive\n",
    "    min_val = 0\n",
    "    # either create bins by interval or by size\n",
    "    if bin_sizes is not None:\n",
    "        intervals = arange(min_val,max_val+bin_sizes,bin_sizes)\n",
    "    if num_bins is not None:\n",
    "        intervals = linspace(min_val,max_val,num_bins)\n",
    "    # list of tuples defining the bins\n",
    "    list_bins = list(zip(intervals,intervals[1:]))\n",
    "    dict_bins = {x: 0 for x in list_bins}\n",
    "    # put each sample into a bin\n",
    "    i = 0\n",
    "    for s in sorted_samples:\n",
    "        if (s > list_bins[i][0]) and (s <= list_bins[i][1]):\n",
    "            dict_bins[list_bins[i]] += 1\n",
    "        else:\n",
    "            while not ((s > list_bins[i][0]) and (s <= list_bins[i][1])):\n",
    "                i += 1\n",
    "            dict_bins[list_bins[i]] += 1\n",
    "    return dict_bins\n",
    "\n",
    "def plot_histogram(histogram):\n",
    "    histogram_items = sorted(list(histogram.items()),key = lambda x:x[0])\n",
    "    x_vals = [mean([x[0][0],x[0][1]]) for x in histogram_items]\n",
    "    y_vals = [x[1] for x in histogram_items]\n",
    "    return x_vals, y_vals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "x,y = plot_histogram(simple_histogram(power_law_samples,num_bins = 100))\n",
    "_ = plt.plot(x,y,'.')\n",
    "_ = plt.title(\"A simple histogram\")\n",
    "_ = plt.ylabel(\"# of samples\")\n",
    "_ = plt.xlabel(\"x\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A histogram on a log-log scale\n",
    "\n",
    "From Newman:\n",
    ">  Notice how noisy the results get in the tail towards the\n",
    "right-hand side of the panel. This happens because the number of samples in the bins becomes small and statistical fluctuations are therefore large as a fraction of sample number. The power-law distribution dwindles in this region, meaning that each bin only has a few samples in it, if any.\n",
    "\n",
    "Noise in the tail is one of the reasons that a log-log histogram isn't always the clearest way to plot a power law distribution. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "x,y = plot_histogram(simple_histogram(power_law_samples,num_bins = 100))\n",
    "_ = plt.loglog(x,y,'.-')\n",
    "_ = plt.title(\"Log-log histogram\")\n",
    "_ = plt.ylabel(\"# of samples\")\n",
    "_ = plt.xlabel(\"x\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A log-log histogram with logarithmic binning\n",
    "\n",
    "From Newman:\n",
    "> An alternative solution is to vary the width of the bins in the histogram [each bin is a fixed multiple wider than the one before it]. If we are going to do this, we must also normalize the sample counts by the width of the bins they fall in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def log_histogram(samples, num_bins = None):\n",
    "    # sort the samples\n",
    "    sorted_samples = sorted(samples)\n",
    "    # grab the max for the bins\n",
    "    max_val = max(samples)\n",
    "    # min is 0, but not inclusive\n",
    "    min_val = 0\n",
    "    # create logarithmically spaced bins\n",
    "    intervals = logspace(min_val,log(max_val+10,10),num_bins,base = 10)\n",
    "    #return intervals\n",
    "    # list of tuples defining the bins\n",
    "    list_bins = list(zip(intervals,intervals[1:]))\n",
    "    dict_bins = {x: 0 for x in list_bins}\n",
    "    # put each sample into a bin\n",
    "    i = 0\n",
    "    for s in sorted_samples:\n",
    "        if (s > list_bins[i][0]) and (s <= list_bins[i][1]):\n",
    "            dict_bins[list_bins[i]] += 1\n",
    "        else:\n",
    "            while not ((s > list_bins[i][0]) and (s <= list_bins[i][1])):\n",
    "                i += 1\n",
    "            dict_bins[list_bins[i]] += 1\n",
    "    return dict_bins\n",
    "\n",
    "def plot_log_histogram(histogram):\n",
    "    histogram_items = sorted(list(histogram.items()),key = lambda x:x[0][0])\n",
    "    x_vals = [10**mean([log(x[0][0],10),log(x[0][1],10)]) for x in histogram_items]\n",
    "    y_vals = [x[1]/(x[0][1] - x[0][0]) for x in histogram_items]\n",
    "    return x_vals, y_vals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "x,y = plot_log_histogram(log_histogram(power_law_samples,num_bins = 100))\n",
    "_ = plt.loglog(x,y,'.')\n",
    "_ = plt.title(\"A log-log histogram with logarithmic binning\")\n",
    "_ = plt.ylabel(\"# of samples\")\n",
    "_ = plt.xlabel(\"x\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plotting the cumulative distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def cumulative_dist(samples,bin_sizes = None, num_bins = None):\n",
    "    # count the number of samples with a value greater than X\n",
    "    # sort the samples\n",
    "    sorted_samples = array(sorted(samples))\n",
    "    # grab the max for the bins\n",
    "    max_val = max(samples)\n",
    "    # min is 0, but not inclusive\n",
    "    min_val = 0\n",
    "    # either create bins by interval or by size\n",
    "    if bin_sizes is not None:\n",
    "        intervals = arange(min_val,max_val+bin_sizes,bin_sizes)\n",
    "    if num_bins is not None:\n",
    "        intervals = linspace(min_val,max_val,num_bins)\n",
    "    # put each sample into a bin\n",
    "    cum_dist = []\n",
    "    for i in intervals:\n",
    "        cum_dist.append(len(sorted_samples) - argmax(sorted_samples > i))\n",
    "    return intervals, cum_dist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "x,y = cumulative_dist(power_law_samples,num_bins = 100)\n",
    "_ = plt.loglog(x,y,'.')\n",
    "_ = plt.title(\"Cumulative distribution\")\n",
    "_ = plt.ylabel(\"# of samples > x\")\n",
    "_ = plt.xlabel(\"x\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Estimating $\\alpha$ from the sampled data\n",
    "\n",
    "A simple method of estimating $\\alpha$ is the following:\n",
    "$$ \\alpha =  1 + \\sum_{i = 1}^n \\ln(\\frac{x_i}{x_{min}})^{-1} $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "xmin = min(power_law_samples)\n",
    "est_alpha = 1 + (len(power_law_samples) * sum([log(x/xmin) for x in power_law_samples])**(-1))\n",
    "print(\"The estimated alpha is: {}\".format(est_alpha))\n",
    "print(\"The actual alpha is: {}\".format(alpha))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Yule process (a.k.a: \"preferential attachment\")\n",
    "\n",
    "From Newman: \n",
    "> A rich-get-richer mechanism in which the most populous cities or best-selling books get more inhabitants or sales in proportion to the number that they already have.\n",
    "\n",
    "I'm particularly interested in this generative process because it seems like it could be a good hypothesis for explaining Twitter platform distributions like \"the number of Likes/RTs per Tweet.\"\n",
    "\n",
    "In networks, this process is commonly labeled as \"preferential attachment.\"\n",
    "\n",
    "There is a functional form for the distribution resulting from the Yule process, but I don't think that's as informative as seeing the process in action. Let's simulate a simple rich-get-richer process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# simulate the yule process\n",
    "# start with, say, a set of 100000 nodes, with a single link\n",
    "# I'm only interested in counting degree distributions, so I will simply record the links by \n",
    "# augmenting the degree count of 2 nodes by 1\n",
    "N = {x:0 for x in range(0,10**6)}\n",
    "# we'll do this 100,000 - 1 times:\n",
    "#    in order, each node gets a single link\n",
    "#    the probability that any link connects to an existing node is proportional to that node's degree\n",
    "#    note: the probability of a node with no links getting a new link is 0\n",
    "in_network_nodes = [0]\n",
    "nodes_by_degree = [0]\n",
    "for x in range(1,10**6):\n",
    "    in_network_nodes.append(x)\n",
    "    N[x] += 1\n",
    "    gets_a_link = choice(nodes_by_degree)\n",
    "    N[gets_a_link] += 1\n",
    "    nodes_by_degree.extend([x,gets_a_link])\n",
    "# the probability that the new link connects to an existing node is proportional to that node's degree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "x,y = plot_histogram(simple_histogram(list(N.values()),num_bins = 1000))\n",
    "_ = plt.loglog(x,y,'.')\n",
    "_ = plt.title(\"Log-log histogram distribution\")\n",
    "_ = plt.ylabel(\"# of samples > x\")\n",
    "_ = plt.xlabel(\"x\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "xmin = min(list(N.values()))\n",
    "est_alpha = 1 + (len(list(N.values())) * sum([log(x/xmin) for x in list(N.values())])**(-1))\n",
    "print(\"The estimated alpha is: {}\".format(est_alpha))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
